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1  Introduction 


Biological  control  systems  do  amazing  things.  In  particular,  biological  organisms  can  coor¬ 
dinate  the  motions  of  many  degrees  of  freedom  to  accomplish  a  complex  task  with  apparent 
ease  and  can  readily  learn  new  patterns  of  coordination.  The  inherent  complexity  of  the 
dynamics  of  a  system  such  as  a  humanoid  performing  a  typical  behavior,  as  well  as  the  high 
dimensionality,  are  prohibitive  for  conventional  engineering  control  schemes.  These  methods 
generally  rely  on  having  a  detailed  model  of  the  system,  which  is  an  inconvenient  assumption 
in  complex  systems  like  a  humanoid.  Also,  traditional  methods  generally  assume  an  a  priori 
desired  trajectory,  which  then  raises  the  issue  of  path  planning.  The  goal  of  our  work  is  to 
develop  new  control  approaches  for  complex  systems  like  humanoids  to  produce  a  desired 
behavior.  The  desired  behavior  is  specified  as  a  natural  verbal  criterion  such  as  “move  for¬ 
ward  without  falling  over”  for  walking,  or  “execute  a  full  twisting  one-and-a-half  somersault 
dive,”  for  platform  diving.  As  the  equations  of  motion  for  these  systems  are  prohibitivelv 
long  and  complex,  we  have  used  the  SD/FAST  software  package  (Symbolic  Dynamics,  Inc. 
[19]),  which  uses  Kane’s  formulation,  for  our  simulations. 

Our  approach  for  dealing  with  the  complexity  of  the  control  problem  is  inspired  by 
biological  systems.  First,  our  controller  design  has  a  hierarchical  structure  which  simplifies 
the  control  task  at  each  level.  The  control  structure  is  hybrid  in  nature;  the  controllers  at 
each  level  of  the  hierarchy  compute  continuous  functions,  but  their  output  is  decoded  into 
discrete  control  actions,  which  then  act  on  the  continuous  dynamical  system.  As  typically 
studied,  hybrid  systems  generally  have  one  discrete  part  and  one  continuous  part,  whereas 
here  we  have  continuous  controllers  producing  discretized  controls  which  act  on  continuous 
systems.  Finally,  the  controllers  themselves  are  not  preconstructed,  but  rather  learn  which 
controls  should  be  applied  to  produce  the  desired  actions. 

Currently,  our  control  system  is  open-loop.  The  system  can,  through  experience  and  rep¬ 
etition,  build  up  an  internal  model  relating  the  desired  movements  to  the  appropriate  control 
signals.  The  learning  controller  can  take  on  many  forms;  here,  we  have  chosen  radial  basis 
function  networks.  The  lower  hierarchical  levels  can  be  trained  by  a  modified  supervised 
learning  algorithm,  while  the  complexity  of  the  task  facing  the  higher  levels  will  require  a  re¬ 
inforcement  learning  approach.  For  a  more  robust  (and  more  realistic  from  a  biological  point 
of  view)  system,  though,  feedback  will  be  required.  In  biological  systems  learning  complex 
new  tasks,  a  progression  from  a  tightly-regulated,  closed-loop  form  of  control  to  open-loop 
control  is  often  seen  as  the  feed-forward  controller  becomes  more  accurate.  Understanding 
control  systems  of  this  type  is  part  of  an  ongoing  project;  currently,  our  aim  is  to  understand 
the  feed-forward  portions  of  the  controls  and  how  they  can  be  learned. 

This  paper  is  organized  as  follows.  In  Section  2,  we  present  the  diving  problem  as 
an  example  of  the  type  of  problem  discussed  above.  We  describe  our  controller  design  in 
Section  3,  and  present  our  learning  algorithms  and  some  preliminary  results  in  Section  4. 
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2  The  Diving  Problem 

The  problem  on  which  we  are  testing  our  control  designs  is  that  of  a  human  platform  diver. 
The  control  problem  for  the  diver  is  as  follows:  given  fixed  initial  conditions  (after  leav¬ 
ing  the  board),  execute  a  certain  maneuver  (a  full  twisting  one-and-a-half  somersault,  for 
example),  and  then  enter  the  water  in  a  fully  extended,  vertical  position  (see  [2]).  There 
is  no  particular  desired  trajectory  specified.  This  problem  has  several  interesting  features. 
After  the  diver  has  left  the  board,  he  is  subject  to  angular  momentum  conservation,  which 
creates  a  nonholonomic  constraint.  The  diver  leaves  the  board  with  some  initial  (non-zero) 
angular  momentum,  however,  so  the  system  has  drift.  The  drift  velocity  depends  on  the 
configuration  of  the  diver.  Since  the  diver  is  falling  w'hile  executing  the  maneuver,  there  is  a 
predetermined  length  of  time  in  which  the  controls  can  act.  Since  the  diver  generally  starts 
with  his  momentum  totally  in  the  somersault  direction,  he  needs  to  execute  a  “throwing” 
maneuver  with  his  arms  to  initiate  twisting  (see  [9]).  We  are  currentl}'^  simulating  a  diver 
with  ten  degrees  of  freedom  in  the  joints:  three  in  each  shoulder,  one  at  each  elbow,  and  one 
at  each  hip. 

Although  much  work  has  been  done  recently  on  the  control  and  steering  of  nonholonomic 
systems,  most  of  it  has  been  for  drift-free  systems  (for  a  survey,  see  [23]).  Some  specific 
cases  wuth  drift  have  been  addressed  ([6];  [24];  [10],  e.g.),  but  very  little  work  exists  con¬ 
cerning  general  systems  with  drift.  Other  control  approaches  to  problems  like  this  include 
that  of  Hodgins  and  Raibert  [18]  and  Wooten  and  Hodgins  [32],  w’ho  divide  complicated 
movements  into  states  of  a  finite  state  machine;  within  each  state,  motions  are  regulated  by 
PD  controllers.  There  are  some  learning  approaches  to  problems  of  this  type  in  the  literature 
as  well:  Gorinevsky,  Kapitanovsky,  and  Goldenberg  [13]  use  radial  basis  functions  to  learn 
the  controls  for  steering  a  space  platform  with  an  arm,  and  Bertsekas  and  Tsitsiklis  [3]  use 
neurodynamic  programming  to  learn  to  control  discrete  systems. 

We  have  tested  some  conventional  techniques  on  a  planar,  two-joint  simplification  of 
the  diver  model,  but  these  proved  unsatisfactory  even  for  the  simplified  system.  A  simple 
learning  algorithm  applied  to  the  planar  diver  was  more  promising  (see  [8]),  and  led  to  our 
continued  work  on  learning  controllers  described  here.  We  are  also  involved  in  an  effort  to 
develop  new  path-planning  methods  for  nonholonomic  systems  with  drift  [11]. 

3  Learning  Controller  Architecture 

In  designing  a  controller,  we  have  taken  some  inspiration  from  biological  systems,  the  original 
learning  controllers.  For  example,  biological  systems  deal  with  dynamic  complexity  through 
hierarchical  organization,  with  different  parts  of  the  motor  control  system  performing  dif¬ 
ferent  functions  (see  Figure  1).  Our  learning  controller,  shown  schematically  in  Figure  2, 
is  also  hierarchical  in  design.  The  coordinating  controller  and  the  single-degree-of-freedom 
(single-DOF)  controllers  all  operate  in  the  discrete-time  domain,  while  the  plant,  a  mechan¬ 
ical  system,  operates  in  continuous  time.  The  links  between  the  regimes  are  provided  by 
the  decoders,  which  convert  the  joint  control  signals  into  torques,  and  the  encoder,  which 
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Figure  1:  Hierarchy  in  vertebrate  motor  control. 
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Figure  2:  Schematic  of  a  general  hierarchical  learning  controller. 
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Figure  3;  Closeup  of  a  single-DOF  controller. 

converts  the  resulting  motion  into  a  movement  parametrization.  In  general,  this  movement 
representation  could  include  parameters  such  as  the  total  distance  moved  forward,  total  num¬ 
ber  of  rotations  about  some  axis,  or  total  movement  time;  the  representation  wall  depend  on 
the  important  features  of  the  type  of  movement  being  controlled.  The  controllers  together 
with  the  decoder,  the  plant,  and  the  encoder  form  a  hybrid  system. 

3.1  Single-DOF  Controllers 

The  single-DOF  controllers  take  as  input  a  vector  in  R"*  from  the  higher-level  coordinating 
controller,  together  with  some  current  joint  sensor  information,  and  produce  an  output  vec¬ 
tor  in  R"  defining  the  joint  torque  profile.  To  design  the  joint  control  torque  parametrization, 
we  again  turn  to  biology  for  direction.  In  a  behavioral  task  such  as  diving,  controls  w^hich 
produce  a  desired  behavior  are  often  nonunique,  so  some  restriction  of  the  allow^ed  controls  is 
needed.  At  low  levels  of  the  control  hierarchy,  biological  systems  accomplish  this  restriction 
through  pattern  generators.  These  relatively  simple  neural  networks  produce  stereotypical 
bursting  patterns  which  can  be  tuned  by  descending  commands  (i.e.,  signals  from  higher 
hierarchical  levels).  It  has  been  known  for  some  time  that  rhythmic  movements  like  walk¬ 
ing,  swimming,  breathing,  and  chewing  are  controlled  in  many  animals  by  periodic  pattern 
generators  (for  a  review,  see  [15]).  More  recently,  several  investigators  have  found  evidence 
for  the  existence  of  low-level  controllers  in  fast,  goal-directed,  single-joint  movements  (see, 
for  example,  [14]).  In  the  model  proposed  by  Gottlieb,  Corcos,  and  Agarwal  in  [14],  the 
low-level  controller  is  a  pulse  generator  w'hich  produces  square  activation  pulses  as  inputs  to 
the  motoneuron  pool.  Such  a  controller  would  produce  the  stereotypical  patterns  often  seen 
in  fast,  single-joint  movements,  which  are  identifiable  by  their  torque,  velocity,  and  double- 
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or  triple-burst  EMG  profiles.  Thus,  higher  levels  in  the  biological  control  hierarchy  can 
choose,  via  tuning,  only  among  the  family  of  controls  put  out  by  the  pattern  generator.  The 
parametrization  of  possible  controls  also  provides  a  compact  representation  of  the  controls, 
which  allows  efficient  storage  and  communication  between  hierarchical  levels. 

The  single-DOF  controllers  and  decoders  we  have  chosen  (see  Figure  3)  play  the  role  of 
the  pattern  generators.  They  receive  the  desired  change  in  joint  angle  and  velocity  and  the 
desired  movement  time  ((A^^,  A^d,  Td)  =  vj)  as  tuning  parameters  from  the  coordinating 
controller.  The  low-level  controllers  are  required  to  compensate  for  some  of  the  initial  con¬ 
ditions  on  the  joint,  so  the  single-DOF  controller  takes  as  parameters  the  initial  velocity 
of  the  DOF  and  the  initial  effective  external  torque  acting  on  that  DOF,  {9o,t)  =  Vg.  We 
can  then  consider  each  single-DOF  controller  to  be  an  element  of  a  two-dimensional  space  of 
controllers  indexed  by  Vg.  For  each  movement,  the  appropriate  controller  from  this  controller 
space  is  selected  based  on  the  sensory  input.  These  sensory  signals  would  be  provided  by 
information  from  joint  sensors  analogous  to  the  stretch  receptors  and  Golgi  tendon  organs 
of  biological  systems. 

As  there  are  three  free  inputs,  the  control  family  should  have  three  specifiable  parameters, 
to  allow  sufficient  movement  richness  while  maintaining  uniqueness  of  the  controls.  Based  on 
the  model  above,  we  have  chosen  torque  profiles  consisting  of  two  square  pulses,  as  indicated 
in  Figure  3.  As  these  torque  profiles  have  four  obvious  parameters,  namely  the  pulse  heights 
and  widths  for  the  two  pulses,  we  use  another  idea  from  [14]  to  restrict  the  control  family. 
There,  it  is  hypothesized  that  the  motor  control  system  uses  tw’o  diflPerent  control  schemes  in 
different  conditions,  pulse  height  modulation  (PHM)  and  pulse  width  modulation  (PWM). 
In  our  controller  design,  therefore,  the  PWM  strategy,  which  we  have  chosen  to  apply  for 
movement  times  larger  than  some  critical  time  Tcru,  requires  the  pulses’  heights  to  be  of  equal 
magnitude  and  opposite  direction,  while  in  PHM  (Ta  <  Tcru),  the  pulses’  widths  are  equal. 
Thus  there  are  three  control  parameters  to  be  specified  in  either  control  strategy,  one  pulse 
height  and  two  pulse  widths  in  the  PWM  case,  and  one  pulse  width  and  two  heights  in  the 
PHM  case.  Thus  for  PWM,  u  =  (pwi,pw2,ph)  (where  the  sign  of  the  pulse  widths  indicates 
the  pulse  direction),  and  for  PHM,  u  =  (phj, ph2, pw).  We  have  currently  implemented 
two  separate  controllers  for  each  single-DOF  controller,  one  for  the  PHM  regime  and  one 
for  the  PWM  regime,  for  ease  of  learning.  A  switch  based  on  the  value  of  Ta  determines 
which  controller  is  active.  The  decoder  interface  converts  the  output  vector  of  the  controller 
into  the  two-pulse  pattern,  which  is  reminiscent  of  both  bang-bang  control  and  the  EMG 
profile  mentioned  above.  In  the  future,  we  may  apply  a  filter  to  these  torques,  in  analogy  to 
the  filtering  action  of  the  motoneuron  pool,  which  would  produce  more  biologically  realistic 
torque  profiles.  The  output  of  a  single  DOF  in  the  plant  is  represented  by  a  single-joint 
encoder  as  a  vector  in  K’^,  Vp  =  [A9,  Ad,T).  In  between  feed-forward  movements,  a  PD 
controller  is  switched  on  to  keep  the  joints  from  drifting.  Joint  limits  are  also  enforced  by 
stiff  PD  controllers  which  become  active  at  the  boundaries  of  the  joint’s  angle  range. 
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Figure  4;  Closeup  of  the  coordinating  controller.  Dotted  arrow  signifies  sampled  updating. 

3.2  Coordinating  Controller 

The  design  of  our  proposed  coordinating  controller  was  also  inspired  by  biology.  Various 
researchers  have  shown  that  the  most  important  piece  of  information  for  humans  learning 
new,  complex  tasks  is  the  relative  timing  between  the  different  movement  segments  or  the 
phasing  between  continuous  movements  (see,  e.g.,  [31];  [28];  [21],  Ch.  1).  Thus,  complex 
skills  can  be  learned  by  combining  more  basic  movement  building  blocks  in  an  appropriate 
way.  It  is  also  the  impression  of  athletes  learning  complex  skills,  like  dives,  that  once  they 
learn  the  basic  building  blocks,  such  as  how  to  start  dive  rotation  and  how  to  pull  out  of  a 
dive,  they  can  learn  different  new  skills  by  simply  learning  how  to  put  the  pieces  together. 

In  our  design,  the  coordinating  controller  takes  as  input  the  desired  movement  parametriza- 
tion,  a  vector  in  R^,  as  well  as  some  state  information,  and  and  is  required  to  output  the 
tuning  inputs  for  each  single-DOF  controller  (see  Figure  4).  To  simplify  the  task  of  the 
controller,  we  define  multi-DOF  synergies,  or  behaviors,  appropriate  to  the  desired  class  of 
movements,  such  as  “pike”  or  “throw”  (the  arm  motion  that  initiates  twisting)  for  the  diving 
problem.  The  controller  need  only  specify  the  synergy  s  to  activate,  the  tuning  parameters 
for  one  single-DOF  controller  in  the  synergetic  group,  coupling  parameters  a  determining 
the  relative  amplitudes  of  motion  of  the  other  DOFs  in  the  synergy,  and  the  time  to  wait 
before  executing  the  synergy  tg-  To  simplify  learning  (see  Section  4),  the  controller  must 
activate  only  one  synergy  at  a  time,  and  thus  essentially  acts  like  a  state  machine,  with  the 
states  corresponding  to  the  behaviors  being  executed.  We  have  also  assumed,  for  simplicity, 
that  dd  will  always  equal  zero  for  all  degrees  of  freedom.  This  coupling  reduces  the  number  of 
degrees  of  freedom  the  coordinator  needs  to  control  directly.  Biological  systems  show  similar 
synergetic  coupling.  In  pointing  movements  involving  both  the  elbow  and  the  shoulder,  for 


7 


example,  the  velocity  profiles  are  identical  for  movements  in  which  the  two  joints  are  required 
to  rotate  in  the  same  or  opposite  directions;  only  the  signs  and  relative  amplitudes  change. 
In  the  future,  the  coordinating  controller  may  also  have  some  control  over  the  diver’s  initial 
conditions. 

We  have  initially  chosen  a  movement  representation  (encoder)  which  specifies  the  total 
angle  of  rotation  in  the  somersault  and  twist  directions  (these  are  unambiguous  since  we 
can  assume  that  rotation  in  the  “cartwheel”  direction  will  be  small),  how  tight  a  pike  the 
diver  executed,  the  squared  error  of  the  joint  angles  from  the  desired  final  entry  position, 
and  an  estimate  of  the  total  energy  expended  in  the  dive  (yp  =  {(f>s,(l>t,k,e,E)).  Other 
variations  on  this  type  of  parametrization  are  possible,  of  course  (c.f.  [7]).  In  the  future, 
we  may  be  required  to  add  more  parameters  for  stylistic  considerations  or  to  reduce  the 
number  of  possible  solutions.  The  state  information  supplied  to  the  controller  consists  of  the 
current  somersault  angle,  sonaersault  velocity,  twist  angle,  twist  velocity,  time,  and  all  ten 
joint  angles  (ys  =  {4>sA$AtAu't^d)).  (To  reduce  the  state  space,  we  have  assumed  that  9 
is  always  near  zero  at  the  end  of  the  movement,  corresponding  with  our  assumption  =  0 
above.)  The  state  information  is  updated  only  at  the  completion  of  a  synergetic  motion. 
The  infrequent  update  of  state  information  is  roughly  similar  to  a  diver’s  ability  to  “spot,” 
or  take  his  positional  bearings  by  sighting  the  w'ater  or  the  board;  the  diver  can  only  receive 
this  information  at  most  once  per  rotation. 

The  controllers  described  here  have  certain  similarities  to  Brockett’s  (u,k,T)  hybrid  motor 
control  system  [5],  but  here  all  dynamics  are  encapsulated  in  the  low'er  level  pattern  gener¬ 
ators,  so  the  controls  are  simply  vectors  rather  than  time  trajectories.  Also,  the  controllers 
are  feed-forward  (except  for  the  limited  use  of  PD  controllers  mentioned  above)  at  this  time, 
though  we  plan  to  add  feedback  in  the  future.  Our  control  system  design  has  several  features 
similar  to  Pil  and  Asada’s  recursive  structure  redesign  algorithm  [25]  as  well. 


4  Learning  Algorithms 

In  biological  systems,  when  the  structure  of  the  system  is  not  known  o,  priori,  an  internal 
model  can  be  built  up  through  learning.  The  learned  model  will  allow  the  system  to  generalize 
from  know'n  tasks  to  new  tasks.  Similarly,  in  our  control  scheme,  the  controllers  learn 
the  required  controls  in  the  absence  of  a  predefined  model.  In  the  behavioral  literature, 
controllers  like  these  might  be  viewed  as  schemas.  The  original  definition  of  a  schema  is 
simply  a  learned  relationship  between  the  input  and  required  output  vectors  of  the  controller 
[27].  Here,  this  would  correspond  to  a  learned  model  of  the  inverse  relationship  governing  the 
lower  levels  of  the  system.  Thus  the  single-DOF  controller,  for  example,  is  a  learned  function 
^-1  :  jjn  approximating  the  inverse  of  the  lumped  system  of  the  the  decoders,  the 

plant,  and  the  single-DOF  encoder,  p  ;  R”  — >  R”^.  The  single-DOF  controllers  were  trained 
with  only  one  degree  of  freedom  in  the  plant  free,  and  all  others  fixed. 
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4.1  Single-DOF  Controllers 

The  implementation  of  the  learning  controllers  can  be  done  in  several  ways.  Our  current  im¬ 
plementation  uses  networks  of  radial  basis  functions  for  both  the  single-DOF  controllers  and 
the  coordinating  controller  (see  [16]  for  an  introduction).  The  output  vector  u  =  /(vj,  Vg) 
of  a  single-DOF  controller,  for  example,  is  given  by 

Ui  =  /i(Vd,Vs)  =  =  $^(Vd)Wi(Vs), 

j=l  ^3 

where  vj  is  the  center  of  the  jth  basis  function,  aj  defines  the  spread  of  the  jth  basis  function, 
4>  is  the  standard  basis  function  itself,  and  the  Wij{vs)s  are  weights.  For  our  system,  we  have 

chosen  (j){s)  =  e~~2.  As  discussed  above,  Vg  defines  a  two-dimensional  space  of  controllers, 
which  appears  here  as  the  two-dimensional  weight  functions  (vg).  Our  current  approach  is 
to  train  the  controller  for  fixed  values  of  Vg  lying  on  a  grid,  and  use  functional  interpolation 
to  obtain  controllers  for  Vg  between  the  grid  points.  We  are  currently  using  radial  basis 
functions  with  constant  spread  arranged  in  a  dense,  grid-centered  sphere  packing,  so  this 
interpolation  is  straightforward.  Before  being  input  to  the  plant,  the  controls  are  passed 
through  a  squashing  function  h(ui)  =  so  the  controls  are  always  within  allowed 

ranges.  With  a  radial  basis  function  architecture,  only  one  layer  is  required  to  approximate 
any  function,  whereas  with  a  conventional  neural  network  architecture,  two  are  needed  if 
the  function  is  discontinuous  (see  [17]  for  a  brief  summary).  Thus,  if  the  centers  and  the 
functions  themselves  are  fixed,  a  linear  algorithm  such  as  recursive  least  squares  can  be 
applied  to  the  weights. 

In  our  implementation,  the  situation  is  a  bit  more  complicated.  To  use  recursive  least 
squares,  one  needs  to  obtain  an  error  measure  on  the  u  produced  by  the  controller.  Here, 
all  we  have  available  is  the  error  on  the  plant  output.  To  get  around  this  problem  for  the 
single-DOF  controllers,  we  have  adopted  the  scheme  shown  in  Figure  5,  with  Vg  fixed.  The 
algorithm  can  be  summarized  as  follows: 

1.  A  random  vj  within  the  controller’s  effective  range  is  generated  and  passed  to  the 
controller. 

2.  The  controller  produces  output  u  in  response  to  its  input,  and  this  control  is  passed 
through  the  squashing  function. 

3.  The  decoder  produces  a  torque  profile  corresponding  to  u,  and  a  single  DOF  of  the 
plant  (all  others  held  fixed)  is  simulated  with  that  control. 

4.  The  single-DOF  encoder  converts  the  plant  output  into  a  vector  in  R" ,  Vp  =  (A0,  A6,  T). 
(vp,  u)  is  a  valid  training  pair  for  the  network. 

5.  Vp  is  fed  back  into  the  controller  as  a  new  input,  v(j.  If  v(j  is  outside  the  range  of  the 
controller,  the  trial  is  aborted. 
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Figure  5:  On-line  training  for  the  single-DOF  controller. 


6.  The  controller  produces  another  output,  u',  in  response  to  the  new  input. 

7.  The  error  between  u'  and  u  is  used  with  a  recursive  least  squares  algorithm  to  adjust 
the  weights  in  the  radial  basis  function  network. 

The  recursive  least  squares  update  we  are  using,  for  each  control  Uj,  is: 


c(n  +  1) 
k(n  +  1) 
Wi(n  -I- 1) 

P(n  -t- 1) 


Mt(n)  —  u'j(n)  =  Ui(n)  —  $^(n)wi(n) 
P(n)$(n) 

1  -f  $^(n)P(n)$(n) 

Wi(n)  -I-  k(n  -f  l)e(n  -I-  1) 

P(n)  -  k(n -h  l)$^(n)P(n) 


P(0)  is  set  to  the  identity  plus  small  random  perturbations  along  the  diagonal. 

The  recursive  least  squares  algorithm  acts  to  minimize  Iju  —  which  implies,  by  defi¬ 
nition,  ||/(vd.  Vs)  /('^d>^s)||  is  also  minimized.  The  structure  of  the  radial  basis  function 
net  certainly  would  permit  /(v:„Vs)  /(vd,Vs)  without  v^,  — ^  Vd,  but  since  the  func¬ 
tion  we  are  trying  to  learn  is  injective,  we  can  try  to  get  the  controller  to  converge  to  the 
desired  fixed  point  by  setting  the  initial  weights  using  least  squares  on  an  initial  dataset. 
We  generate  this  dataset  by  simulating  randomly  generated  controls  (within  the  restricted 
control  ranges)  on  the  single-DOF  plant,  and  then  pruning  to  remove  trials  in  which  the 
outcomes  were  far  outside  the  desired  velocity  range  or  had  come  up  against  the  joint  limits. 
The  effective  range  of  the  controller  is  estimated  by  the  spread  of  the  VpS  produced  in  this 
dataset.  The  larger  the  initial  dataset,  the  more  representative  this  estimate  will  be  of  the 
true  range  of  outcomes  achievable  with  the  allowed  controls.  If  the  joint  runs  up  against  the 
joint  limits  during  on-line  learning,  a  virtual  error  estimating  the  overshoot  prevented  by  the 
joint  limit  is  added  to  the  position  error,  to  facilitate  the  learning. 

Our  simulations  use  the  SD/FAST  software  package  [19]  with  a  three-dimensional  diver 
model  generously  shared  with  us  by  Jessica  Hodgins  (see  [32]).  Preliminary  simulations 
on  the  single-DOF  controllers  are  promising.  Figure  6  shows  the  squared  errors  in  the 
controls  u  and  the  plant  output  Vp  for  online  training  of  a  PHM  controller  for  shoulder 
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abduction/adduction  with  Vs  =  (0, 0).  The  network  has  739  basis  functions  with  a  =  .06. 
The  basis  functions  are  arranged  in  a  grid-centered  sphere  packing  so  that  the  closest  distance 
between  basis  function  centers  is  \/2cr,  or  2a  along  the  grid  axes.  The  network  was  initialized 
with  a  dataset  containing  1439  elements  after  pruning.  Figure  7  shows  a  typical  movement 
produced  by  this  controller. 

As  can  be  seen  in  Figure  6,  the  errors  converge,  but  a  rather  large  error  in  the  final 
velocity  remains.  This  error  is  smaller  with  larger  networks  on  finer  grids,  but  this  brings  us 
up  against  the  curse  of  dimensionality  common  with  locally-acting  approximators:  to  double 
the  number  of  basis  functions  along  each  dimension,  we  must  increase  the  total  number  of 
basis  functions  in  the  controller  by  a  factor  of  eight.  We  can  also  expect  convergence  to 
be  slower  with  larger  networks.  The  situation  is  still  worse  because  of  the  two  dimensions 
added  by  requiring  a  different  set  of  weights  for  each  Vg.  The  computation  and  storage 
required  for  this  scheme  can  quickly  become  huge.  Global  approximation  methods  such  as 
neural  networks  do  not  suffer  from  the  curse  of  dimensionality  to  such  an  extent  as  this, 
but  they  are  generally  trained  with  local  gradient  methods,  which  cannot  guarantee  a  global 
solution.  Techniques  such  as  covariance  reset,  which  revitalize  the  recursive  least  squares 
training  algorithm,  may  improve  the  convergence,  however,  and  possibly  allow  us  to  use 
smaller  networks.  It  is  clear  that  much  remains  to  be  done  in  terms  of  exploring  systems 
such  as  these  and  investigating  new  controller  designs. 

4.2  Coordinating  Controller 

For  the  coordinating  controller,  a  similar  learning  scheme  may  also  be  possible.  However,  a 
more  fruitful  approach  ma}"  be  that  of  reinforcement  learning.  Good  surveys  of  reinforcement 
learning  can  be  found  in  [26]  and  [22].  In  reinforcement  learning,  the  output  error  ||yp  — yd|| 
is  minimized  directly,  and  the  learning  can  be  distributed  over  sequences  of  actions.  One 
variant  which  does  not  require  a  system  model  is  Q-learning.  A  function  Q  is  defined  for 
each  state  y  =  (yd,ys)  and  each  control  action  v  as 

Q{y,  v)  =  R{y)  +  max  Q{y',  v') 

'  V* 

where  y'  is  the  successor  state  to  y  under  v  and  R  is  the  reward  accrued  in  each  state.  For 
the  diver  problem,  no  reward  is  accrued  until  the  end  of  the  dive,  when  the  diver  reaches  the 
water.  Then  the  reward  is  based  on  the  error  between  yp  and  yd-  The  radial  basis  function 
network  is  required  to  learn  an  approximation  to  the  Q  function  for  the  system,  with  the 
error,  or  temporal  difference, 

d  =  R{y\  +  max  Q{y',  v')  -  Q{y,  v) 

being  used  at  each  transition  for  recursive  least  squares  update  of  the  network  weights.  Thus 
the  values  of  the  final  states  are  propagated  backward  to  earlier  states.  Such  an  approach, 
using  dynamic  programming  ideas  with  a  network  approximation  of  the  Q  or  value  function, 
has  been  called  neuro-dynamic  programming  [3].  A  separate  action  selector  would  select  the 
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Error  in  Controls 


Error  In  Plant  Output 


Figure  6:  Squared  errors  averaged  over  groups  of  100  trials  for  training  a  PHM  controller  for 
shoulder  abduction/adduction.  The  top  plot  shows  the  squared  errors  in  u  before  squashing 
(with  (3  =  .8).  After  squashing,  each  control  is  a  scaled  value  in  (0, 1).  The  solid  line 
represents  pw,  the  dashed  line  phj,  and  the  dash-dot  line  ph2-  The  bottom  plot  shows  the 
squared  errors  in  Vp;  the  solid  line  represents  M  (squared  error  multiplied  by  a  factor  of 
ten  for  visibility),  the  dashed  line  A^,  and  the  dash-dot  line  T  (multiplied  by  100).  6  is 
measured  in  radians,  9  in  radians  per  second,  and  T  in  seconds. 
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Figure  7:  A  movement  produced  by  the  trained  PHM  single-DOF  controller.  The  input  is 
Add  =  -|,A0d  =  0,Td  =  .15.  00  =  0,  r  =  0.  The  controls  are  phj  =  -61.365,  ph2  =  60.211, 
pw  =  .0697,  and  the  plant  output  is  A9  =  —.687,  A9  =  —.132,  T  —  .15. 

next  action  taken  by  the  controller  based  on  a  balance  between  maximizing  the  Q  function 
over  all  the  possible  control  actions  (an  optimal  policy)  and  exploring  the  action  space. 
Since  the  desired  movement  parametrization  appears  in  the  input  to  the  controller,  the  same 
network  can  be  trained  to  produce  several  different  dives.  We  are  currently  implementing 
this  reinforcement  learning  algorithm  for  training  the  coordinating  controller. 

For  systems  in  which  the  state  space  is  discrete,  and  the  Q  values  can  be  stored  in  a  table, 
if  the  states  contain  sufficient  information  that  the  system  is  Markov,  then  the  Q-learning 
algorithm  converges  (see  [1],  [20],  [29]).  For  a  system  like  the  diver,  however,  where  we  require 
a  function  approximator,  convergence  is  more  problematic.  Boyan  and  Moore  [4]  give  simple 
examples  in  which  substituting  a  function  approximator  for  a  lookup  table  results  in  loss 
of  convergence.  These  examples  can  be  made  to  converge  by  changing  slightly  the  methods 
used,  however;  see  [22]  for  a  summary.  Certain  types  of  function  approximators  have  been 
shown  to  guarantee  convergence  when  combined  with  dynamic  programming  techniques;  in 
particular,  neural  network  approximators  may  not  converge,  but  certain  linear  interpolation 
approimators  will  [12],  as  will  some  feature-based  methods  (including  radial  basis  function 
networks)  satisfying  certain  properties,  under  a  modified  dynamic  programming  algorithm 
[30].  It  is  our  hope  that  these  results  can  be  extended  for  our  radial  basis  function  networks 
with  reinforcement  learning  algorithms  similar  to  the  Q-learning  algorithm  described  above. 
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Conclusions 


The  hybrid,  hierarchical,  learning  control  structure  biological  motor  control  systems  use 
to  deal  with  system  complexity  and  unknown  models  can  provide  inspiration  for  tackling 
difficult  control  problems.  In  designing  a  hybrid  control  structure,  often  the  most  critical 
pieces  are  the  decoder  and  encoder.  Biological  systems  suggest  pattern  generators  as  models 
for  the  decoder,  and  suggest  using  desired  features  of  the  movement  (rather  than  a  specific 
desired  trajectory)  to  create  the  encoder.  We  have  designed  a  learning  control  structure 
using  these  ideas  and  are  testing  it  on  the  diving  problem.  The  single-DOF  controllers  play 
the  role  of  pattern  generators  in  the  controller,  restricting  the  allowed  torque  profiles  to 
a  family  of  two-pulse  controls.  The  coordinating  controller  provides  the  tuning  inputs  to 
the  single-DOF  controllers  based  on  the  type  of  dive  that  is  desired.  For  the  lower-level 
controllers,  a  modified  form  of  supervised  learning  can  be  applied,  but  for  the  coordinating 
controller,  reinforcement  learning  is  more  appropriate.  We  believe  that  new  approaches  such 
as  the  learning  controller  presented  here  will  be  essential  to  making  headw'ay  on  difficult 
behavioral  control  problems;  much  work  remains  to  be  done  in  this  area. 
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